2025-01-07
Importance of human judgment, context-knowledge
High quality data, in comparison to automated methods such as dictionnaries
Highly time consuming, human labor intensive and costly
Often need to rely on a small sample of texts, which can be biased
| Machine Learning Lingo | Statistics Lingo |
|---|---|
| Feature | Independent variable |
| Label | Dependent variable |
| Labeled dataset | Dataset with both independent and dependent variables |
| To train a model | To estimate |
| Classifier (classification) | Model to predict nominal outcomes |
| To annotate | To (manually) code (content analysis) |
Licht et al. (2024) : A supervised learning workflow
Do, Ollion, and Shen (2024) : Policy vs Politics classification task
Use the test set to evaluate the performance of the model
The model is used to predict the categories of the texts in the test set
The predictions are compared to the true categories with different metrics
Accuracy : proportion of correctly classified texts (highly limited for imbalanced datasets)
\[ \ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \ \]
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
\[ \ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \ \]
\[ \ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \ \]
\[ \ \text{f1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \ \]
| Problem | Solution |
|---|---|
| Unbalanced classes | Undersampling & oversampling |
| Not enough training data | More annotation |
| Bad quality of the training data | Better annotation |
| Bad quality of the text features | Better preprocessing |
| Limited text representation | Go for more complex models |
| Too complex concept | Accepting okay-ish performance |
Convergence validy of measure of populism
Do, Ollion, and Shen (2024)
Peterson and Spirling (2018) : accuracy as a measure of polarization
Licht et al. (2024) : predicting the use of anti-elite strategies
Sattelmayer forthcoming : the effect of party position on immigration on vote switching to the far right
Supervised text classification